Search results for "Suffix array"

showing 10 items of 16 documents

Efficient Algorithms for Sequence Analysis with Entropic Profiles

2017

Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign …

0301 basic medicineCompressed suffix arrayTheoretical computer scienceEntropySuffix tree0206 medical engineeringGeneralized suffix tree02 engineering and technologyString searching algorithmInformation theorylaw.invention03 medical and health scienceslawGeneticsAnimalsHumansMathematicsApplied MathematicsSuffix arrayComputational BiologyDNASequence Analysis DNAData structure030104 developmental biologySuffixAlignment free Entropy Sequence analysis Sequence comparisonAlgorithms020602 bioinformaticsBiotechnologyIEEE/ACM Transactions on Computational Biology and Bioinformatics
researchProduct

Parallel and Space-Efficient Construction of Burrows-Wheeler Transform and Suffix Array for Big Genome Data

2016

Next-generation sequencing technologies have led to the sequencing of more and more genomes, propelling related research into the era of big data. In this paper, we present ParaBWT, a parallelized Burrows-Wheeler transform (BWT) and suffix array construction algorithm for big genome data. In ParaBWT, we have investigated a progressive construction approach to constructing the BWT of single genome sequences in linear space complexity, but with a small constant factor. This approach has been further parallelized using multi-threading based on a master-slave coprocessing model. After gaining the BWT, the suffix array is constructed in a memory-efficient manner. The performance of ParaBWT has b…

0301 basic medicineTheoretical computer scienceBurrows–Wheeler transformComputer scienceGenomicsData_CODINGANDINFORMATIONTHEORYParallel computingGenomelaw.invention03 medical and health scienceslawGeneticsHumansEnsemblMulti-core processorApplied MathematicsLinear spaceSuffix arrayChromosome MappingHigh-Throughput Nucleotide SequencingGenomicsSequence Analysis DNA030104 developmental biologyAlgorithmsBiotechnologyReference genomeIEEE/ACM Transactions on Computational Biology and Bioinformatics
researchProduct

Computing the Original eBWT Faster, Simpler, and with Less Memory

2021

Mantaci et al. [TCS 2007] defined the \(\mathrm {eBWT}\) to extend the definition of the \(\mathrm {BWT}\) to a collection of strings. However, since this introduction, it has been used more generally to describe any \(\mathrm {BWT}\) of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original \(\mathrm {eBWT}\), which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the \(\mathrm {BWT}\) of a single string that uses …

2019-20 coronavirus outbreakSpeedupString collectionsBig BWTSettore INF/01 - InformaticaSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2)String (computer science)Suffix arrayOrder (ring theory)omega-orderQuantitative Biology::GenomicsBurrows-Wheeler-TransformBurrows-Wheeler-Transform String collections SAIS Big BWT prefix-free parsing extended BWTlaw.inventionCombinatoricsprefix-free parsingSimple (abstract algebra)lawSAISSAIS algorithmIndependence (probability theory)extended BWTMathematics
researchProduct

On the construction of classes of suffix trees for square matrices: Algorithms and applications

1995

Given an n × n TEXT matrix with entries defined over an ordered alphabet σ, we introduce 4n−1 classes of index data structures for TEXT. Those indices are informally the two-dimensional analog of the suffix tree of a string [15], allowing on-line searches and statistics to be performed on TEXT. We provide one simple algorithm that efficiently builds any chosen index in those classes in O(n2 log n) worst case time using O(n2) space. The algorithm can be modified to require optimal O(n2) expected time for bounded σ.

CombinatoricsCompressed suffix arraylawSuffix treeString (computer science)Generalized suffix treeSuffix arraySuffixAlgorithmFM-indexlaw.inventionMathematicsLongest common substring problem
researchProduct

On-line construction of two-dimensional suffix trees

1997

We present a new technique, which we refer to as implicit updates, based on which we obtain: (a) an algorithm for the on-line construction of the Lsuffix tree of an n x n matrix A — this data structure, described in [13], is the two-dimensional analog of the suffix tree of a string; (b) simple algorithms implementing primitive operations for LZ1-type on-dine lossless image compression methods. Those methods, recently introduced by Storer [35], are generalizations of LZl-type compression methods for strings (see also [24, 31]). For the problem in (a), we get nearly an order of magnitude improvement over algorithms that can be derived from known techniques [13]. For the problem in (b), we do …

CombinatoricsSuccinct data structureCompressed suffix arrayTree (data structure)Computer sciencelawSuffix treeString (computer science)Generalized suffix treeSuffixData compressionlaw.invention
researchProduct

Linear-size suffix tries

2016

Suffix trees are highly regarded data structures for text indexing and string algorithms [MCreight 76, Weiner 73]. For any given string w of length n = | w | , a suffix tree for w takes O ( n ) nodes and links. It is often presented as a compacted version of a suffix trie for w, where the latter is the trie (or digital search tree) built on the suffixes of w. Here the compaction process replaces each maximal chain of unary nodes with a single arc. For this, the suffix tree requires that the labels of its arcs are substrings encoded as pointers to w (or equivalent information). On the contrary, the arcs of the suffix trie are labeled by single symbols but there can be Θ ( n 2 ) nodes and lin…

Compressed suffix arrayGeneral Computer ScienceSuffix tree[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS]Generalized suffix tree0102 computer and information sciences02 engineering and technologyData_CODINGANDINFORMATIONTHEORYText indexing01 natural sciencesY-fast trielaw.inventionLongest common substring problemTheoretical Computer ScienceCombinatoricsSuffix treelawFactor and suffix automata0202 electrical engineering electronic engineering information engineeringData_FILESArithmeticFactor and suffix automata; Pattern matching; Suffix tree; Text indexing; Theoretical Computer Science; Computer Science (all)Pattern matchingMathematicsSettore INF/01 - InformaticaX-fast trieComputer Science (all)LCP array010201 computation theory & mathematics020201 artificial intelligence & image processingFM-index
researchProduct

Inducing the Lyndon Array

2019

In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in $O(n)$ time using $\sigma + O(1)$ words of working space, where $n$ is the length of the text and $\sigma$ is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. In fact, all the known linear algorithms for Lyndon array computation use suffix sorting as a preprocessing step and use $O(n)$ words of working space in addition to the Lyndon array and suffix array. Experimental results with real and synthetic datasets show that our algorithm is not onl…

FOS: Computer and information sciences050101 languages & linguisticsComputer scienceComputationInduced suffix sorting02 engineering and technologySpace (mathematics)law.inventionSuffix sortinglawSuffix arrayComputer Science - Data Structures and Algorithms0202 electrical engineering electronic engineering information engineeringData_FILESPreprocessorData Structures and Algorithms (cs.DS)0501 psychology and cognitive sciencesComputer Science::Data Structures and AlgorithmsTime complexitySettore ING-INF/05 - Sistemi Di Elaborazione Delle InformazioniSettore INF/01 - Informatica05 social sciencesLightweight algorithmSuffix arraySigmaComputer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing)Induced suffix sorting; Lightweight algorithms; Lyndon array; Suffix arrayWorking spaceLyndon arrayLightweight algorithms020201 artificial intelligence & image processingAlgorithmComputer Science::Formal Languages and Automata Theory
researchProduct

Sorting suffixes of a text via its Lyndon Factorization

2013

The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can b…

FOS: Computer and information sciencesBWTLyndon FactorizationSettore INF/01 - InformaticaSorting Suffixes; Lyndon Factorization; Lyndon WordsSuffix arrayComputer Science - Data Structures and AlgorithmsData_FILESData Structures and Algorithms (cs.DS)Lyndon wordSorting suffixeSorting SuffixesLyndon Words
researchProduct

Uncommon Suffix Tries

2011

Common assumptions on the source producing the words inserted in a suffix trie with $n$ leaves lead to a $\log n$ height and saturation level. We provide an example of a suffix trie whose height increases faster than a power of $n$ and another one whose saturation level is negligible with respect to $\log n$. Both are built from VLMC (Variable Length Markov Chain) probabilistic sources; they are easily extended to families of sources having the same properties. The first example corresponds to a ''logarithmic infinite comb'' and enjoys a non uniform polynomial mixing. The second one corresponds to a ''factorial infinite comb'' for which mixing is uniform and exponential.

FOS: Computer and information sciencesCompressed suffix arrayPolynomialLogarithmGeneral MathematicsSuffix treevariable length Markov chain[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS]Generalized suffix treeprobabilistic source0102 computer and information sciences02 engineering and technologysuffix trie01 natural scienceslaw.inventionCombinatoricslawComputer Science - Data Structures and AlgorithmsTrieFOS: Mathematics0202 electrical engineering electronic engineering information engineeringData Structures and Algorithms (cs.DS)Mixing (physics)[ INFO.INFO-DS ] Computer Science [cs]/Data Structures and Algorithms [cs.DS]MathematicsDiscrete mathematicsApplied MathematicsProbability (math.PR)020206 networking & telecommunicationssuffix trie.Computer Graphics and Computer-Aided Design[MATH.MATH-PR]Mathematics [math]/Probability [math.PR]010201 computation theory & mathematicsmixing properties60J05 37E05Suffix[ MATH.MATH-PR ] Mathematics [math]/Probability [math.PR]Mathematics - ProbabilitySoftware
researchProduct

Lightweight LCP construction for very large collections of strings

2016

The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is reali…

FOS: Computer and information sciencesComputer scienceComputation0102 computer and information sciences02 engineering and technologyParallel computing01 natural sciencesGeneralized Suffix ArrayTheoretical Computer Sciencelaw.inventionlawComputational Theory and MathematicComputer Science - Data Structures and AlgorithmsExtended Burrows-Wheeler TransformData_FILES0202 electrical engineering electronic engineering information engineeringDiscrete Mathematics and CombinatoricsData Structures and Algorithms (cs.DS)Discrete Mathematics and CombinatoricAuxiliary memoryLongest Common Prefix Array; Extended Burrows-Wheeler Transform; Generalized Suffix Array;String (computer science)LCP arraySuffix arrayData structureComputational Theory and Mathematics010201 computation theory & mathematicsLongest Common Prefix Array020201 artificial intelligence & image processingJournal of Discrete Algorithms
researchProduct